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Inflated  Speedups  in  Parallel  Simulations 

via  malloc () 

David  M.  Nicol* 

College  of  William  and  Mary 


\  /  Abstract 

Vi/ 

Discrete-event  simulation  programs  make  heavy  use  of  dynamic  memory  allocation 
in  order  to  support  simulation’s  very  dynamic  space  requirements.  When  programming 
in  C  one  is  likely  to  use  the  mallocQ  routine.  However,  a  parallel  simulation  which 
uses  the  standard  Unix  System  V  malloc ()  implementation  may  achieve  an  overly  op¬ 
timistic  speedup,  possibly  superlinear.  An  alternate  implementation  provided  on  some 
(but  not  all)  systems  can  avoid  the  speedup  anomaly,  but  at  the  price  of  significantly 
reduced  available  free  space.  This  is  especially  severe  on  most  parallel  architectures, 
which  tend  not  to  support  virtual  memory.  This  paper  illustrates  the  problem,  then 
shows  how  a  simply  implemented  user-constructed  interface  to  mallocQ  can  both 
avoid  artificially  inflated  speedups.  and  make  efficient  use  of  ihe  dynamic  memory 
space.  The  interface  simply  caches  blocks  on  the  basis  of  their  size.  We  demonstrate 
the  problem  empirically,  and  show  the  effectiveness  of  our  solution  both  empirically 

and  analytically.  /  |  ✓  o 
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1  Introduction 


Dynamic  memory  allocation  plays  an  important  role  in  the  implementation  of  discrete-event 
simulations.  For  example,  in  a  queueing  network  simulation  blocks  of  memory  are  dynami¬ 
cally  allocated  and  freed  as  the  event  list  changes,  and  as  jobs  enter  and  exit  the  network. 
When  programming  in  C  one  invariably  calls  mallocO  to  request  a  block  of  dynamic  mem¬ 
ory,  and  calls  free()  to  release  it.  A  programmer  may  not  give  great  deal  of  thought  to 
mallocO’s  underlying  implementation.  Commonly  used  implementations  search  a  linked - 
list  of  freed  blocks  for  a  match.  As  the  length  of  the  list  grows,  the  cost  of  calling  mallocO 
grows.  As  we  will  show,  this  may  lead  to  falsely  optimistic  speedup  measurements  of  a 
parallelized  simulation  (or  any  other  program  which  makes  heavy  use  of  dynamic  memory). 
Other  implementations  have  a  cost  that  is  nearly  independent  of  the  number  of  available 
freed  blocks.  However,  scalable  implementations  allocate  blocks  that  are  significantly  larger 
than  the  requested  block  size.  Over-allocation  poses  little  problem  in  a  virtual  memory 
system  where  the  effective  size  of  the  memory  can  be  measured  in  gigabytes — but  most  par¬ 
allel  architectures  do  not  support  virtual  memory.  An  implementation  that  over-allocates 
physical  memory  space  reduces  the  size  of  the  simulation  model  one  can  run. 

This  paper  illustrates  the  problem,  and  offers  a  simple  solution.  The  solution  exploits 
the  fact  that  there  are  often  only  a  few  different  sizes  of  blocks  requested  from  mallocO.  A 
user  may  easily  write  an  interface  to  mallocO  and  free()  that  caches  freed  blocks  on  the 
basis  of  their  size.  A  request  to  the  interface  for  a  block  of  size  n  searches  the  cache  for  a  list 
of  freed  blocks  of  size  n.  The  space-efficient  version  of  mallocO  is  called  if  the  cache  fails  to 
satisfy  the  request.  The  interface  to  free()  rarely  calls  it.  Instead,  it  places  places  the  freed 
block  into  the  cache.  We  show  empirically  and  analytically  that  this  solution  “scales"  the 
cost  of  requesting  a  block  of  memory  is  nearly  independent  of  the  number  of  freed  blocks. 
We  also  show  that  the  solution  permits  the  simulation  of  larger  models  than  is  possible  using 


the  standard  scalable  (but  space-inefficient)  mallocO. 

This  paper  is  organized  as  follows.  §2  explains  how  standard  implement  at  ions  of  malloc  () 
either  cause  false  speedups,  or  allocate  space  inefficiently.  §2  describes  our  solution.  §4 
presents  empirical  data  that  demonstrates  the  problem,  and  illustrates  the  effectiveness  of 
our  solution.  §5  analytically  shows  that  the  proposed  spare  management  algorithm  scales 
with  problem  size.  §6  summarizes  this  paper. 
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2  The  Problem 

The  standard  Unix  System  V  implementation  of  mallocO  [10]  maintains  a  linked  list  of  all 
freed  blocks,  ordered  linearly  by  memory  address.  Each  block  records  its  size  and  location. 
Because  of  the  ordering,  free()  can  quickly  determine  whether  a  newly  freed  block  can  be 
merged  with  a  physically  adjacent  block.  A  request  to  mallocO  is  satisfied  by  scanning 
the  free  list  until  a  block  of  sufficient  size  is  found.  This  block  is  split  in  two;  one  subblock 
is  returned  to  satisfy  the  mallocO  request,  while  the  other  remains  in  the  free  list.  The 
average  time  required  to  complete  a  mallocO  call  depends  on  the  average  length  of  the 
list.  Larger  simulation  models  will  tend  to  demand  more  dynamic  memory  and  fragment  the 
dynamic  memory  space  more,  thereby  causing  more  costly  mallocO  calls. 

Let  us  now  characterize  the  “size”  5  of  a  simulation  model  in  terms  of  the  average  number 
of  dynamic  memory  blocks  that  have  been  allocated  and  not  yet  freed  at  any  given  instant. 
If  the  simulation  has  for  some  time  constantly  requested  and  freed  blocks  randomly,  then 
the  number  of  blocks  in  the  freed  list  will  be  proportional  to  5,  and  the  average  cost  of  a 
mallocO  call  will  be  proportional  to  g(S ),  where  g  is  some  increasing  function.  Let  us  also 
characterize  the  simulation  workload  in  terms  of  Ar,  the  total  number  of  mallocO  calls  it 
makes.  If  N  is  very  large  compared  to  the  number  of  calls  that  scrambled  the  freed  list, 
the  simulation’s  execution  time  will  be  proportional  to  Ng(S).  Now  suppose  that  the  same 
simulation  has  been  distributed  among  P  processors.  If  the  workload  is  evenly  balanced  each 
processor  receives  1/P-th  of  the  simulation  model,  and  l/P-th  of  the  simulation  activity. 
This  means  that  the  size  of  the  simulation  at  one  processor  is  S/P,  so  that  the  cost  of  a 
mallocO  call  is  proportional  to  g(S/P).  Furthermore,  the  number  of  mallocO  calls  it 
performs  is  N/P.  If  all  P  processors  execute  in  parallel,  the  time  required  to  perform  t he 
N  mallocO  calls  is  proportional  to  (N / P)g(S / P) — a  speedup  of  order  Pg(S)/ g(S/ P)  over 
the  serial  implementation.  Since  g{S)  >  g(S/ P)  the  speedup  is  superlinear. 

It  is  important  to  get  an  accurate  measurement  of  speedup,  because  only  then  can  we 
assess  the  benefit  of  parallelism  to  the  end  user.  One  may  assume  that  in  a  serial  context 
the  user  will  execute  an  optimized  implementation;  when  possible,  speed  ups  should  be  mea¬ 
sured  against  an  optimized  serial  solution.  This  is  not  always  practical,  and  so  specdups 
are  sometimes  measured  against  one-processor  implementations  of  the  parallel  algorithm. 
The  research  community  seems  to  accept  this  practice,  provided  that  the  complexity  of  the 
one-processor  solution  is  the  same  as  the  optimized  serial  solution  In  this  way  the  serial 
solution  is  asymptotically  optimal  to  within  a  constant  factor.  For  example,  a  massively  par¬ 
allel  sorting  algorithm  may  require  o(n2)  comparisons,  but  use  o(n2)  processors  to  achieve  a 
fast  solution.  Computation  of  speedup  based  on  an  o(n2)  serial  sorting  algorithm  is  frowned 
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upon.  The  speedups  we  study  are  computed  from  one-processor  implementations  of  a  par¬ 
allel  approach.  Based  on  experimentation  with  queueing  networks,  we  estimate  that  the 
serial  timings  we  obtain  are  no  more  than  twice  as  large  as  those  of  highly  tuned  optimal 
implementations,  at  least  on  queueing  networks. 

Use  of  a  non-optimal  (in  terms  of  complexity)  serial  implementation  is  often  the  under¬ 
lying  cause  of  overly  optimistic  measurements  of  speedup1.  For  example,  practitioners  of 
parallel  simulation  have  observed  superlinear  speedup  in  early  development  phases  of  their 
algorithms,  due  to  the  use  of  quickly  implemented  linearly-ordered  event  lists.  Most  (but 
not  all— see  [3])  realize  that  performance  measurements  taken  under  these  conditions  are 
meaningless:  any  log-time  priority  list  scheme  will  accelerate  both  the  serial  and  parallel 
implementations,  and  not  exhibit  superlinear  speedups.  Careful  researchers  of  parallel  simu¬ 
lation  ensure  that  their  event  list  algorithms  and  synchronization  mechanisms  scale  properly 
as  the  number  of  processors  changes. 

Observation  of  superlinear  speedup  is  a  clear  indication  of  a  problem.  More  insidious  is 
the  case  where  a  non-optimal  serial  implementation  causes  speedup  to  be  inflated,  but  not 
superlinear.  One  may  be  tempted  to  accept  good  speedups  at  face  value,  without  questioning 
possible  inflation.  We  identify  a  simple  metric,  total  workload ,  which  reveals  the  presence 
of  inflated  speedups.  Total  workload  is  simply  the  sum  of  the  execution  time  of  all  “useful" 
simulation  work,  in  all  processors.  Inflated  speedups  of  the  type  induced  by  mallocO  are 
recognized  when  total  workload  decreases  radically  as  the  number  of  processors  is  increased. 

In  theory  one  can  always  defeat  inflated  speedups  by  constructing  a  serial  algorithm 
which  emulates  the  parallel.  That  option  is  not  so  easily  chosen  for  our  problem,  as  delving 
into  the  system-level  details  of  dynamic  space  management  is  not  an  activity  for  the  faint- 
of-heart.  One  solution  exists  in  the  form  of  a  different  implementation  of  mallocO.  which 
is  standard  under  Berkeley  Unix  systems  and  is  usually  offered  as  an  option2  under  System 
V.  The  size  of  each  block  is  of  the  form  2J  —  4  bytes.  This  form  results  from  a  partitioning 
of  the  dynamic  memory  in  blocks  of  powers  of  two;  mallocO  reserves  four  bytes  in  each 
block  for  its  own  use.  A  list  of  free  blocks  is  maintained  for  each  possible  si/'  .  The  size 
of  a  requested  block  is  rounded  up  to  the  nearest  available  size,  and  a  free  Mock  from  that 
list  is  returned.  Should  that  list  be  empty,  a  larger  block  is  returned.  1  he  proper  block 
list  is  found  after  a  few  shifts,  and  constant-time  unlinking  operations  to  release  the  block. 
However,  if  the  requested  block  sizes  are  uniformly  random,  the  size  of  an  average  request 
will  fall  half-way  between  two  block  sizes.  On  average,  a  third  </  the  allocated  space  will 

’See  [2]  for  an  interesting  classification  of  causes  of  superlinear  speer1  ip. 

2Tliis  is  not  always  the  case.  At  the  time  of  this  writing  the  opeiatmg  system  delivered  with  the  Intel 
iPSC'/860  does  not  include  this  option. 


be  wasted.  Node  processors  on  most  parallel  architectures  do  not  support  virtual  memory. 
The  over-allocation  comes  from  physical  memory,  thereby  reducing  the  size  of  the  simulation 
model  that  can  be  evaluated  on  the  machine.  However,  the  cost  of  calling  this  version  oi 
mallocO  is  nearly  independent  of  the  number  of  outstanding  memory  blocks.  We  will  refer 
to  this  version  as  the  sraluhlt  mallocO. 

The  problem  t,h<  :i  is  to  find  a  way  of  managing  dynamic  space  that  avoids  inflated 
speedups,  and  which  makes  efficient  use  of  space.  As  is  so  olten  1  he  case  in  computer 
science,  the  answer  lies  in  caching. 

3  Caching  Freed  Blocks 

Any  caching  scheme  relies  on  some  locality  property,  usually  related  to  memory  addresses 
and  the  temporal  pattern  of  accesses  to  them.  The  locality  we  exploit  is  that  ol  si:t  the  size 
of  a  requested  block  tends  to  be  one  of  only  a  few  sizes  requested  throughout  the  simulation. 
Any  given  simulation  will  have  a  number  ot  object  types  for  which  it  creates  and  destroys 
instances;  in  our  experience  the  number  of  different  types  (and  hence  object  sizes)  often  is 
not  large.  We  may  therefore  emulate  scalable  mallocO  and  maintain  (at  the  application 
level)  a  list  of  freed  blocks  for  each  frequently  used  block  size.  We  will  suppose  there  is  a 
maximum  number  L  of  lists  we  will  maintain  in  tin*  cache'.  A  similar  scheme  was  proposed 
some  years  ago  for  the  caching  of  procedure  Iratnes  in  the'  Mesa  system  [5], 

A  request,  for  a  block  of  size'  n  is  handled  by  first  seairhing  te>  se-e-  il  a  list  leer  size'  u  blocks 
is  present  in  the  cache.  If  it  is,  anel  if  there-  is  a  free  block  e>f  size-  n  the-  bleak  is  eh-linke-el 
and  returned.  If  the  cache-  contains  an  empty  list  for  size'  n  we-  e'all  mallocO  to  supply  a 
block.  Failure  to  find  a  size  tt  list  results  in  the  creatmi)  of  one-.  In  our  own  applieat ions  it 
is  ve'ry  rare  te>  require*  meere  than  ten  different  size-el  blocks.  1  he*  ceale  we-  prese-nt  and  tlie' 
imple-me-ntat  ions  we-  test  all  se-t.  an  uppe-r  beumel  eat  t  he- number  eel  lists.  Die  interlace-  can  he 
modifie'd  tee  support  applicat  ieais  whe>se-  size-  re-epiire-me'nts  e'hange*  dynamically:  it  t  In'  number 
of  existing  lists  eepials  L  at  the-  time- a  ne*w  list  is  re-epiire-el,  we- can  re-place-  an  e*xisting  list. 
All  FRF  ( he-ast-He-e  e-nt  ly-l 'Se-(l)  policy  may  govern  re-placement.  The1  list  "feme'lie-d'  nuesl 
distantly  in  the-  past,  is  scle-cte-el,  anel  all  of  its  bleaks  may  be-  re'turne-d  via  free(). 

Figure  1  give*s  the  source  ceale-  fe>r  eeur  implementation  e>f  ssmallocO  and  ssfreeO  t  lie- 
scalable  spaee'-e'llieie-nt  dynamic  memeery  reeutines.  The-se-  reait'me-s  use-  the-  lirst  weuel  in  the- 
block  either  as  a  link  (whe-n  in  the*  free-  list),  or  te>  stem-  the-  bleak  size-  (when  allocated).  I  lie- 
versions  shown  are-  terse;  our  actual  imple-meutat ieens  incluele-  erre>r  and  sanity  che-cks.  Other 
impleme-ntat ion.s  may  be-  more  efficient  when  the-  number  eef  lists  is  larger,  lor  e-sample-,  enie- 
might  hash  on  the  size  eel  the-  re-epie-ste-d  block. 
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#def ine  MAXPTRS  10 

struct  Buff erPtrStruct  {  int  length;  char  **ptr; 

}  BufferPtr [MAXPTRS]  ; 

char  *ssmalloc(size) 
int  size; 

{ 

char  **ptr,*ans;  int  i=0; 

while(Buf ferPtr[i] . length  &&  size  !=  Buff erPtr [i] . length)  i++; 
Buff erPtr [i] . length  =  size;  /*  in  case  this  is  new  */ 

if (BufferPtr [i] .ptr)  /*  List  non-empty7  */ 

{  ptr  =  BufferPtr[i] .ptr;  /*  get  block  request  */ 

BufferPtr [i] .ptr  =  (char  **)*ptr;  /*  delink  free  block  */  ; 

> 

else  ptr  =  (char  **)malloc(size  +  sizeof(char  **)); 

*ptr++  =  (char  *)BufferPtr [i] .length;  /*  record  size  */ 
return((char  *)ptr) ; 

> 

void  ssfree(ptr) 

char  **ptr; 

{ 

int  size,i=0; 

ptr — ;  size  =  *(int  *)ptr;  /*  back  up  to  size  field  */ 
while(Buf ferPtr[i] . length  &&  size  !=  BufferPtr [i] . length)  i++; 
*ptr  =  (char  *)BufferPtr [i] .ptr; 

BufferPtr [i] .ptr  =  ptr; 

> 


Figure  1:  Space-Efficient  Scalable  Dynamic  Memory  Routines 


4  Empirical  Studies 


Wo  now  present  empirical  evidence  that  inflated  speedups  due  to  mallocO  can  occur.  First 
we  demonstrate  that  in  theory  speedups  can  be  inflated  by  as  much  as  an  order  of  magnitude. 
This  extreme  case  is  achieved  when  dynamic  memory  management  routines  completely  dom¬ 
inate  the  computation.  We  then  examine  the  problem  in  the  context  of  a  working  parallel 
simulation  system,  YAWNS  (Yet  Another  Windowing  Network  Simulator)  [6,  7].  YAWNS 
provides  a  common  platform  for  the  parallel  simulation  of  many  different  types  of  networks. 
The  demands  on  dynamic  memory  come  primary  from  the  handling  of  small  “logical  mes¬ 
sages”  passed  between  network  elements,  from  dynamic  event  creation/deletion,  and  from 
internal  bookkeeping  activities.  The  user  defines  the  messages  and  the  message  sizes.  In  the 
simulation  models  we  have  developed  the  number  of  different  block  sizes  is  less  than  ten.  We 
examine  the  performance  of  YAWNS  on  two  different  simulation  problems.  Both  illustrate 
the  phenomenon  of  inflated  speedups  due  to  mallocO,  one  exhibits  supcrlinear  speedup. 

We  are  interested  in  three  performance  characteristics:  raw  finishing  time,  behavior  of 
the  speedup  curve,  and  maximal  simulatable  problem  size.  We  will  look  at  these  character¬ 
istics  as  measured  using  standard  System  V  mallocO,  using  scalable  mallocO,  and  using 
ssmallocO . 

4.1  Superlinear  mallocO 

The  potent  ial  for  superlinear  speedups  is  demonstrated  by  measuring  the  average  cost  of 
calling  mallocO  (or  freeO)  as  a  function  of  the  “size”  of  the  problem.  Our  experiments 
show  that  on  large  problems,  the  average  cost  of  calling  mallocO  is  over  10  times  greater 
than  the  average  cost  on  small  problems.  Therefore,  if  the  problem  can  be  split  among  enough 
processors  so  that  each  has  a  “small”  problem,  speedups  that  are  an  order  of  magnitude  larger 
than  linear  might  be  observed. 

We  measured  the  average  cost  of  mallocO  and  ssmallocO  on  one  node  of  the  Intel 
iPSC/2  [1]  in  the  following  way.  To  create  a  problem  of  size  .S'  we  construct  an  array  of  S 
pointers,  which  will  point  to  blocks  of  dynamic  memory.  Each  array  position  is  assigned  a 
block  of  a  given  size.  The  possible  sizes  (in  bytes)  are  8,  16,  32,  61,  128,  and  256.  Assignment 
of  sizes  to  array  positions  is  cyclic:  slots  0,6,  12,  •  •  ■  get  size  8,  slots  1,7,  13,  ■  •  •  get  size  16, 
and  so  on.  At  initialization,  an  array  position  is  either  filled  with  a  (jointer  to  a  block  of 
the  appropriate  size,  or  is  left  empty.  The  choice  is  made  randomly,  with  equal  likelihood 
for  either  possibility.  Next  we  iterate,  making  many  passes  over  the  array.  On  each  pass,  at 
each  array  position,  we  randomly  decide  with  equal  likelihood  whether  or  not  to  change  the 
status  of  the  array  position.  If  a  decision  to  change  the  status  is  made,  a  non-empty  array 
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pointer  is  changed  by  freeing  the  indicated  block;  the  status  of  an  empty  array  pointer  is 
changed  by  allocating  a  new  block,  a  pointer  to  which  is  stored  in  the  array  location. 

5  models  the  size  of  a  simulation  problem.  On  average  there  will  be  S/2  blocks  allo¬ 
cated  in  the  array,  implying  that  the  average  number  of  freed  blocks  in  mallocO ’s  list  is 
proportional  to  S/2.  As  S  grows  we  expect  the  cost  of  calling  mallocO  to  grow. 

We  can  measure  the  time  required  to  iterate  a  given  number  of  times  over  the  array, 
then  count  the  number  of  calls  made  to  space  allocation  routines,  and  compute  the  average 
total  time  per  call.  However,  this  measurement  includes  overhead  due  to  looping,  testing, 
random  number  generation,  and  the  like.  The  overhead  can  be  accounted  for  by  performing 
an  identical  run  that  does  everything  except  call  the  space  allocation  routines.  The  actual 
average  cost  of  calling  a  space  allocation  routine  is  computed  by  taking  the  difference  in 
timings  for  the  two  runs,  arid  dividing  by  the  number  of  routine  calls. 

Figure  2  plots  the  average  cost  of  calling  mallocO  or  free(),  and  the  average  cost  of 
calling  ssmallocO  or  ssfreeO,  as  a  function  of  the  logarithm  (base  two)  of  the  array  size, 
5.  The  cost  of  the  scalable  space-efficient  routines  is  seen  to  be  very  nearly  constant — it 
rises  slightly  from  19  //- sec  to  21  //-sec  as  S  goes  from  22  to  2lf>.  The  cost  of  mallocO  and 
free()  remains  relatively  constant  at  16  //-sec  for  5  between  22  and  26.  However,  for  S 
larger  than  26  the  cost  rises,  reaching  186  //-sec  at  S  —  216.  This  is  over  10  times  slower 
than  its  average  cost  at  S  =  22.  Therefore,  in  the  most  extreme  case  it  would  be  possible 
to  achieve  speedup  which  is  a  factor  of  10  larger  than  linear.  This  is  unlikely  to  happen, 
because  other  scalable  costs  are  involved  in  the  computation  and  will  serve  to  mute  the 
effect  of  a  non-scaling  mallocO.  But,  as  we  will  see,  real  applications  can  suffer  inflated 
and  sometimes  superlinear  speed  ups  due  to  mallocO. 

4.2  Inflated  Speedups  in  YAWNS 

Next  we  show  how  mallocO  can  inflate  speedup  measurements  of  a  real  application,  the 
YAWNS  parallel  simulation  system.  We  will  look  at  the  performance  characteristics  of 
YAWNS  on  two  different  simulation  problems.  The  first  simulates  the  movement  of  objects 
through  an  abstract  hypercube  structure.  An  object  resides  at  a  hypercube  node  for  a  ran¬ 
dom  period  of  time,  then  randomly  selects  some  node  connected  t  o  its  current  one  and  moves 
there.  Nodes  do  not  impose  queueing,  so  any  number  of  objects  may  reside  concurrently  at 
a  node.  This  simulation  is  interesting  because  it  exhibits  superlinear  speedup.  The  second 
simulation  is  of  Conway’s  Came  of  Life.  Speedups  for  this  problem  become  inflated,  but  are 
not  (usually)  superlinear.  This  problem  also  reveals  how  scalable  mallocO  limits  the  siz.e 
of  problem  one  can  simulate. 

The  object  movement  simulation  was  written  as  a  simple  driver  to  test  YAWNS  during 
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Average  cost  of  mallocQ/freeQ  call 


|0%(S) 

Figure  2:  Average  cost  of  space  allocation  routines,  as  function  of  problem  size 


implementation  and  debugging.  Our  discovery  of  superlinear  speedup  on  this  problem  led 
to  the  inquiry  and  solution  reported  in  this  paper.  Our  suspicions  fell  upon  mallocQ  only 
after  vve  had  eliminated  every  other  possibility. 

We  will  use  one  specific  problem  instance  to  illustrate  superlinear  speedup,  although  this 
simulation  model  routinely  achieves  it  under  a  wide  variety  of  circumstances.  The  problem 
instance  moves  4096  objects  between  nodes  in  an  8  dimensional  hypercube  structure  (256 
nodes).  Each  object  resides  at  the  node  for  a  random  period  of  time,  composed  of  the 
constant  0.25  plus  an  exponential  random  variable  with  mean  1.  It  chooses  a  new  destination 
with  equal  likelihood  among  all  the  nodes  connected  to  its  current  one.  The  simulation 
terminates  when  the  simulation  time  reaches  100.  This  requires  the  processing  of  a  few 
hundred  thousand  events. 

Table  1  presents  data  taken  from  a  representative  problem  run.  The  simulation  is  written 
in  such  a  way  that  given  an  initial  random  number  seed,  exactly  the  same  events  occur,  in¬ 
dependently  of  the  number  of  processors  used.  We  present  data  from  use  of  both  mallocO 
and  ssmalloc().  The  processor  utilization  figures  come  from  independent  on-processor 
measurement  of  the  total  time  spent  performing  overhead  activity  interprocessor  cominu- 
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nication,  synchronization  delays  clue  to  blocking,  synchronization  activity  req u i red  by  the 
synchronization  protocol,  etc.  The  utilization  figure  thus  represents  the  average  fraction 
of  time  a  processor  spends  performing  “useful”  sinrulation  work.  This  workload  should  be 
independent  of  the  number  of  processors  used.  In  theory,  one  can  always  estimate  the  total 
time  spent  executing  useful  workload  by  multiplying  together  the  average  utilization,  the 
finishing  time,  and  the  number  of  processors.  We  call  this  the  Total  Workload.  Our  tables 
provide  this  calculation. 

In  practice,  the  true  average  processor  utilization  may  be  difficult  to  obtain,  because 
lack  of  good  hardware  timing  mechanisms  make  it  difficult  to  measure  small  bursts  of  over¬ 
head  activity.  An  additional  problem  is  faced  by  programs  using  optimistic  synchronization 
mechanisms  such  as  Time- Warp  [4],  because  the  processing  of  an  event  can  be  overhead, 
if  the  event  is  later  rolled  back.  A  method  for  measuring  utilization  in  a  Time  Warp  Sys¬ 
tem  is  given  in  [S]  (although  they  call  this  the  effective  utilization ,  allowing  the  definition 
of  ■‘“utilization”  to  include  overhead).  The  method  is  based  on  measuring  the  time  delay 
associated  with  processing  each  event.  If  the  events  have  small  executions  times  relative  to 
the  timer  resolution,  considerable  error  may  creep  into  the  utilization  estimate.  A  side-effect 
of  YAWNS  global  style  of  synchronization  is  that  the  overhead  occurs  in  bursts  isolated  from 
the  processing  of  useful  work.  While  clock  granularity  can  still  be  an  issue,  it  is  less  of  a 
problem  than  it  would  be  under  more  asynchronous  synchronization  protocols. 

The  data  clearly  shows  superlinear  speedup  under  mallocO,  and  shows  udiy  the  speedup 
is  inflated.  The  total  workload  on  one  processor  is  C5%  larger  than  the  total  workload  on 
16 — the  serial  version  appears  to  be  doing  more  work.  The  extra  work  can  be  attributed  to 
the  larger  cost  of  calling  mallocO  and  free()  on  larger  problems.  Caching  compares  well 
with  mallocO.  Not  only  is  the  total  workload  relatively  constant  (less  than  5%  deviation 
between  the  1  and  16  processor  total  workloads)  so  that  the  speedups  behave  properly,  the 
raw  finishing  times  are  better  as  well.  One  shouldn’t  expect  the  total  workload  measurement 
to  be  exactly  constant,  as  a  fair  amount  of  noise  creeps  into  the  timing  process — the  resolution 
of  the  clock  available  to  a  program  on  the  Intel  iPSC/2  is  coarse,  at  one  millisecond. 

Checking  the  relative  constancy  of  total  workload  is  a  useful  way  of  detecting  whether 
an  application  has  inflated  speedups.  Speedup  inflation  may  go  unnoticed  if  the  speedups 
are  sublinear.  However,  if  speedups  are  inflated,  measurement  of  the  total  workload  (when 
possible)  will  reveal  it.  This  is  the  case  with  the  second  YAWNS  application  we  consider, 
Conway’s  Game  of  Life. 

The  Game  of  Life  consists  of  a  toroidal  mesh  of  cells,  each  of  which  is  either  dead  or 
alive.  Time  progresses  in  unit  steps.  The  state  of  a  cell  at  time  n  is  determined  by  a  simple 
rule.  If  the  cell  was  alive  at  time  n  —  1,  then  it  remains  alive  at,  n  if  and  only  exactly  three 
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Processors 

secs 

utilization 

Speedup 

Total  Workload 

1 

851 

99% 

1.00 

842 

2 

414 

88% 

2.05 

728 

4 

171 

86% 

4.97 

588 

8 

83 

81% 

10.2 

537 

16 

45 

71% 

18.9 

511 

Performance  data  using  mallocQ 


Processors 

secs 

utilization 

Speedup 

Total  W7ork)oad 

1 

493 

99% 

1.00 

488 

2 

250 

96% 

1.97 

480 

4 

132 

89% 

3.73 

470 

8 

71 

82% 

C.94 

466 

16 

41 

71% 

12.02 

466 

Performance  data  using  ssmallocO 


Table  1:  Performance  measurements  for  moving  object  simulation 


of  its  immediate  neighbors  (at  all  8  points  of  the  compass)  were  alive  at  time  n  —  1.  The 
rationale  is  that  if  fewer  than  three  neighbors  are  alive  the  cell  dies  of  loneliness,  while  if 
more  than  three  neighbors  are  alive  it  dies  of  overcrowding.  Similarly,  a  cell  which  was  dead 
at  time  n  —  1  springs  to  life  spontaneously  at  time  n  if  it  has  exactly  three  live  neighbors  at 
time  n  —  1. 

One  usually  thinks  of  the  Game  of  Life  in  the  context  of  cellular  automata,  but  discrete- 
event  simulation  provides  an  efficient  mechanism  for  performing  the  computation.  The  events 
are  re-evaluation  of  a  cell’s  state.  Whenever  a  cell  changes  state  it  sends  a  message  to  each 
of  its  8  neighbors  informing  them  of  the  change.  A  cel!  which  receives  a  change  of  state 
message  must  re-evaluate  its  own  state,  as  its  environment  has  changed. 

The  Game  of  Life  consumes  a  great  deal  of  dynamic  memory  on  large  boards,  owing  to 
the  high  number  of  messages  that  a  cell  sends  when  its  state  changes.  It  is  therefore  a  good 
problem  for  illustrating  the  short-comings  of  scalable  malloc().  We  measured  the  largest 
board  size  that  could  be  simulated  for  25  time-steps  without  exhausting  memory,  given  a 
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random  initial  assignment  of  cell  states  where  each  cell  is  alive  with  probability  0.2.  Of 
course,  the  largest  board  size  possible  depends  on  the  initial  assignment,  but  one  run  on  16 
processors  is  fairly  representative  of  the  others.  On  the  representative  run,  the  malloc() 
implementation  was  able  to  simulate  a  357  x  256  cell  board.  Under  ssmallocQ  it  handled 
board  sizes  up  through  315  x  256  cells,  while  under  scalable  mallocQ  it  failed  after  a  board 
of  size  1  17  x  256. 

Standard  mallocO  permits  the  simulation  of  a  board  which  is  13%  larger  than  the  largest 
one  permitted  under  ssmallocQ.  This  is  explained  almost  exactly  by  the  fact  that  intercell 
messages  are  7  words  long.  The  space  for  these  messages  always  comes  originally  from 
mallocO,  but  ssmallocO  asks  it  for  8  words — the  extra  one  is  used  by  ssmallocQ  and 
ssfreeO  for  linking,  and  for  storing  of  the  message  size.  Thus,  on  this  problem  ssmallocQ 
suffers  a  12.5%  space  overhead.  It  is  possible  to  eliminate  this  overhead,  for  a  price.  mallocQ 
writes  its  own  secret  information  in  the  first  word  before  the  one  returned.  We  could  overwrite 
that  word  for  own  purposes,  but  then  could  never  return  the  block  to  freeQ  as  we  need  to 
if  ssmallocQ  were  modified  to  support  more  than  L  different  list  sizes. 

The  failings  of  scalable  mallocQ  are  clear-over  a  50%>  reduction  in  the  size  of  the  maxi¬ 
mum  simulatable  problem  .  That  the  degradation  is  so  much  greater  than  the  33%  "average" 
one  infers  from  a  quick  average  case  analysis  is  largely  due  to  the  primary  message  size.  Scal¬ 
able  mallocQ  uses  a  block  of  15  words  to  satisfy  ssmallocQ !s  request  for  8  words.  Also, 
t  he  quick  average  case  analysis  does  not  account  for  the  fact  f  hat  a  failure  to  find  t  he  smallest 
block  which  contains  a  request  causes  the  selection  of  blocks  that  are  4  or  more  times  larger 
than  the  size  of  1 1  le  request. 

Table  2  presents  performance  measurements  obtained  using  mallocQ.  scalable'  mallocQ 
and  ssmallocQ  on  a  64  cell  board,  simulated  for  50  steps,  where  the  initial  probability  of  a 
cell  being  alive  is  0.2.  64  x  64  is  the  largest  power-of-two  sized  board  that  a  single  processor  is 
capable  of  handling  using  mallocQ ,  or  ssmallocQ.  Scalable  mallocQ  requires  the  memory 
of  four  processors  to  simulate  a  board  of  that  size,  for  that  long.  Lacking  a  serial  timing, 
we  omit  speedup  calculations  for  scalable  mallocQ.  However,  it  is  clear  that  ssmallocQ 
is  approximately  20%:  faster  than  scalable  mallocQ,  as  well  as  being  more  space  efficient. 


5  Analysis 

A  simple  analytic  model  supports  the  observed  near-constant  cost  of  ssmallocQ.  We  model 
the  behavior  of  a  single  list  of  commonly  sized  blocks  as  a  probabilistic  birth-death  process, 
and  show  that  the  average  number  of  transitions  between  expensive  calls  to  mallocQ  grows 
exponentially  fast.  Even  if  the  cost  of  calling  mallocQ  is  linear  in  the  number  of  outstanding 
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Processors 

secs 

utilization 

Speedup  Tota 

!  Workload 

1 

197 

99% 

1.00 

195 

2 

91 

85% 

2.16 

155 

4 

50 

68% 

3.91 

137 

8 

29 

61% 

6.87 

139 

16 

17 

51% 

11.7 

137 

Performance  data  using  mallocO 
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secs 

utilization 

Total  Workload 

4 

36 

79% 

133 

8 

20 

72% 

116 

16 

12 

63% 

120 

Performance  data  using  scalable  mallocO 


Processors 

secs 

utilization 

Speedup 

Total  Workload 

1 

97 

99% 

1.00 

96 

2 

50 

90% 

1.95 

99 

4 

28 

78% 

3.45 

•  88 

8 

16 

68% 

5.93 

89 

16 

10 

58% 

9.63 

93 

Performance  data  using  ssmallocO 


Table  2:  Performance  measurements  for  Game  of  Life  simulation 


memory  blocks,  there  are  exponentially  many  “cheap”  ssmallocO  calls  between  the  linearly 
expensive  ones.  The  expensive  calls  are  therefore  amortized  over  so  many  cheap  calls  that 
the  average  cost  of  calling  ssmallocO  is  nearly  constant. 

Let  2\, . . .  ,Tl  be  the  set  of  lists,  holding  blocks  of  size  si, . . .  ,s/y  respectively.  Consider 
the  sequence  of  ssmallocO  and  ssfreeO  calls  to  all  lists  on  one  processor.  We  will  call 
this  the  complete  sequence.  We  can  always  filter  the  complete  sequence  and  consider  only 
those  calls  for  blocks  of  size  Sj.  VVe  suppose  that  this  stream  forms  a  simple  Markovian 
birth-death  process  whose  state  is  the  number  of  dynamic  blocks  allocated  by  ssmallocO. 
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but  not  yet  released.  From  a  non-zero  state,  with  probability  pj  <  0.5  a  call  in  (lie  filtered 
stream  for  Tj  will  be  to  ssmalloc(),  with  probability  q}  =  1  —  pj  it  will  be  to  ssfree().  If 
all  requested  blocks  have  been  returned,  then  the  state  is  zero  and  the  next  call  must  be  for 
ssmallocO. 

A  non-constant  cost  of  calling  ssmallocO  occurs  whenever  the  appropriate  list  is  empty. 
This  event  coincides  with  the  Markov  chain  achieving  a  record .  or  new  maximal  state.  The 
ith  record  Rij  for  list  Tj  is  defined  simply  to  be  the  number  of  chain  transitions  that  occur 
before  state  i  is  reached  for  the  first,  time.  We  are  interested  in  E[RUJ]  —  /?[/?,■_ i . /].  for 
i  =  2,3...,.,  as  these  differences  indicate  how  often,  on  average,  ssmalloc()  must  call 
mallocO . 

Let  Sn.j  denote  the  total  number  of  blocks  associated  with  Tj,  either  explicitly  in  the  list 
or  still  allocated  at  the  nth  complete  ssmallocO  or  ssfreeO  call.  Snj  is  just  the  index 
of  the  last  record  defined  for  this  list,  e.g.  Snj  =  k  if  k  is  the  largest  record  R^j  such  that 
Rk, j  <  k.  We  assume  that  the  average  cost  of  calling  mallocO  at  the  nth  complete  call 
is  an  increasing  sublinear  function  g  of  the  total  number  of  allocated  but  un freed  blocks  at 
the  nth  complete  call:  0(E[g(J2j-i[Sn,j)])-  The  poi  nt  we  will  establish  is  that  this  non¬ 
constant  cost  increases  by  only  0(1)  every  time  mallocO  is  called,  or  equivalently,  every 
time  some  list  achieves  a  new  record.  We  will  show  that  for  each  Tj,  the  expected  number 
of  calls  between  records  (E[Rij  —  Ri-itJ])  grows  exponentially  in  i.  This  implies  that  the 
number  of  references  between  calls  to  mallocO  grows  exponentially  as  i  increases,  so  that 
each  “expensive”  ssmallocO  call  is  amortized  over  exponentially  many  constant-cost  calls. 

We  now  derive  an  expression  for  £[/?;, j  —  /2i— i.j],  for  any  list  Tj.  The  only  way  to  reach 
state  i  the  first  time  is  through  state  i  —  1.  It  requires  transitions  to  reach  i  —  1  for  the 

first  time.  Then,  a  Bernoulli  trial  with  probability  p3  determines  whether  state  i  is  achieved 
in  the  next  transition.  In  fact,  the  number  of  times  after  the  first  that  the  chain  touches 
state  i  —  1  before  stepping  up  to  state  i  is  a  geometric  random  variable  G  minus  1,  where 
E[G ]  =  \/pj.  Each  time  the  chain  fails  to  step  up  from  i  —  1  to  i  it  wanders  off  in  the  lower 
indexed  region  of  the  state-space  before  returning.  The  number  of  transitions  involved  in 
each  wandering  away  is  a  random  variable  with  mean  //,_ t.  Each  wandering  is  independent 
and  identically  distributed  as  any  other,  and  G  —  1  is  a  "stopping  time"  for  the  sequence  of 
wanderings.  If  the  number  of  transitions  in  the  kth  wandering  is  denoted  by  IE*,  then 

<7-1 

Ru  =  Ri-ij+  £  ir*  +  i. 

k=  1 

Applying  Wald's  lemma [9]  to  the  random  sum  and  rearranging  we  find  that 

E[Rij]-lW-ij]=  +  1.  (1) 
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We  will  derive  //,_!  from  the  fact  that  the  mean  time  between  visits  to  a  state  k  in  an  ergodic 
Markov  chain  is  equal  to  the  reciprocal  of  the  limiting  occupancy  probability  of  state  k  :  I/tt*. 
[9], 

The  subchain  in  which  the  wandering  occurs  is  simply  a  birth-death  process  with  reflect¬ 
ing  states  at  0  and  i  —  1.  The  occupancy  probability  of  state  i  —  1  is  derived  using  standard 
techniques.  First,  the  local  balance  equations  are  set  up: 


7T0  =  <7j7r, 

PjXk  =  <jj~k+  i  for  k  =  1, . . . ,  i  —  3  , 

Pj^i— 2  —  TTt— 1 


from  which  it  follows  that 


or  equivalently, 


•T,_l 


F.-i  = 

Ki-l 


As  i  increases,  the  limiting  probability  tt 0  decreases.  Furthermore,  since  ( qj/pj )  >  1,  it 
follows  that  //,_]  increases  exponentially  fast  in  i.  Applying  this  observation  to  equation  (1) 
we  see  that  the  expected  number  of  transitions  bet  ween  records  grows  exponentially  in  i. 

The  cost  of  calling  mallocO  grows  at  most  linearly  in  the  total  number  of  records 
achieved  by  all  the  lists.  However,  for  each  list  the  expected  number  of  transitions  between 
records  is  growing  exponentially  in  the  number  of  records.  Consequently,  on  average  the 
sublinear  cost  of  mallocO  is  amortized  over  sufficiently  many  constant-cost  calls  that  the 
asymptotic  average  cost  of  calling  ssmallocQ  is  nearly  constant. 


6  Summary 

Dynamic  space  management  is  an  important  component,  of  many  discrete-event  simulations. 
When  programming  in  C,  one  is  likely  to  use  mallocO  to  acquire  blocks  of  free  space.  How¬ 
ever,  commonly  used  versions  of  mallocO  either  induce  inflated  speedups,  or  overallocate 
memory  by  as  much  as  50%.  This  paper  gives  empirical  evidence  of  the  problem,  and  then 
proposes  that  dynamic  memory  blocks  be  cached  on  the  basis  of  their  size.  We  demonstrate 
empirically  and  analytically  that  the  proposed  solution  is  effective. 
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